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Abstract 

Connectionist feed-forward networks, trained with back-propagation, can be used both for non- 
linear regression and for (discrete one-of-C) classification, depending on the form of training. This 
report contains two papers on feed-forward networks. The papers can be read independently. 
They are intended for the theoretically-aware practitioner or algorithm- designer, however, they 
also contain a review and comparison of several learning theories so provide a perspective for the 
theoretician. The first paper works through Bayesian methods to complement back-propagation in 
the training of feed-forward networks. The second paper addresses a problem raised by the first: 
how to efficiently calculate second derivatives on feed-forward networks. 
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Chapter 1 

Bayesian Back-Propagation 


Connectionist feed-forward networks, trained with back-propagation, can be used both 
for non-linear regression and for (discrete one-of-C) classification, depending on the 
form of training. This paper works through approximate Bayesian methods to both 
these problems. Methods are presented for various statistical components of back- 
propagation: choosing the appropriate cost function and regularizer (interpreted as a 
prior), eliminating extra weights, estimating the uncertainty of the remaining weights, 
predicting for new patterns (“out-of-sample”), estimating the uncertainty in the choice 
of this prediction (“error bars”), estimating the generalization error, comparing differ- 
ent network structures, and adjustments for missing values in the training patterns. 
These techniques refine and extend some popular heuristic techniques suggested in the 
literature, and in most cases require at most a small additional factor in computation 
during back-propagation, or computation once back-propagation has finished. The pa- 
per begins with a comparative discussion of Bayesian and related frameworks for the 
training problem. 


1.1 Introduction 

Back-propagation [RHW86] is a popular scheme for training feed-forward connectionist net- 
works. Recent improvements have looked at tasks such as speeding up learning using more 
sophisticated gradient descent algorithms [BL88, EJM90], eliminating insignificant weights 
in order to find networks of an optimal size [LDS90, WHR90, WRH91], and various modi- 
fications to apply a network to a one-of-C classification task (prediction of a single discrete 
variable taking one of C mutually exclusive and exhaustive values) rather than a non-linear 
regression task (prediction of a single real variable) [Bri89, DL91]. 

There is much to be gained by looking at the intersection between probabilistic tech- 
niques and theory, and feed-forward connectionist networks. On the one hand, feed-forward 
networks offer a broad computational framework for implementing a wide variety of prob- 
abilistic functions, and on the other hand, probability theory offer techniques for refining 
back-propagation schemes. We consider both these issues in this paper. 

Since Bayesian probabilistic methods for learning tasks sometimes prove superior to 
other approaches on smaller samples [Bun90, Ber85, Bun91, 0H91], we here frame the 


2 



Bud tine and Weigend 


3 


probabilistic component of back-propagation in a Bayesian context. Back-propagation algo- 
rithms are known to work well in a variety of noisy and uncertain domains using a variety 
of different network structures and activation functions, so we propose Bayesian modifica- 
tions and extensions to existing methods. Of course, in adopting a Bayesian justification for 
the methods presented, we are not claiming any neurological validity for our methods. We 
view feed-forward networks as a vehicle for massive parallelism in classification, regression 
and learning. This occurs because feed-forward networks can be constructed from simple 
component parts but still are able to represent a broad range of functions. 

The beauty of the Bayesian approach arises from the fact that the entire method is 
derived as an approximation to applying the single simple formula 

posterior a prior • likelihood 

to the training problem, together with some principles for reasoning about the prior knowl- 
edge available. 

Section 1.2 gives some theoretical background and motivation for using Bayesian meth- 
ods, and discusses some alternative approaches. Section 1.3 details the notation and form 
of networks considered here. Section 1.4 outlines some network forms that correspond to 
variations of standard probability functions. A good feed-forward network system should 
therefore subsume the tasks of several special purpose statistical systems, albeit at some 
computational cost. Section 1.5 presents a probabilistic analysis of the training of feed- 
forward neural networks. Section 1.6 uses these results to present Bayesian embellishments 
for the standard back-propagation algorithm: cost functions, and weight-evaluation mea- 
sures. Several minimum encoding approaches [WF87, BC91, Ris87] are also explained in 
Section 1.6.3. Finally, Section 1.7 discusses some extensions to back-propagation involving 
weight-elimination strategies, prediction of variables and generalization error, and handling 
missing values. 


1.2 On Bayesian methods 

Bayesian methods described here are based on estimating posterior probabilities for differ- 
ent network weights. The posterior corresponds to a relative measure of belief in the many 
possible network weights, similar to the statistical mechanics idea of an ensemble of net- 
works [LTS89]. Statistical mechanics theory has been developed in the context of training a 
perceptron in a noise-free environment to estimate the generalization error [0H91, SST91], 
While this is an impressive demonstration of the theory, we are concerned with developing 
Bayesian methods for more general feed-forward networks trained in noisy and/or uncertain 
environments, as commonly found in practice. 

Bayesian and statistical mechanics methods both share the view that since we can never 
determine the “correct” or “best” weights, we should carefully reason about the reasonable 
possibilities. In Bayesian practice, however, such as [Mac91] more care is required in the 
selection of the prior and the error model, etc. We present methods for that here. Bayesian 
methods carefully separate the components of the learning problem: the priors on networks 
and network weights which represent our expectations before any training and correspond to 
regularizes; the utility or loss function for the problem which represent the goal of learning 
and correspond to quantities such as minimum errors or least squares; and the likelihood 
function for the network, such as a “Gaussian error model” as used in regression, which 
gives a statistical model of the network and how the data is expected to have been generated 
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initially. Notice that in general there is no “correct” prior, error model or likelihood, since 
by definition these vary from problem to problem and it is a challenge to try and make a 
choice that seems appropriate in a given circumstance. Notions such as the “true” subjective 
probability function [G 0 I 88 ] or the “correct” likelihood function derived from the additive 
energy function [LTS89] are therefore only partially consistent with our general approach, 
though in several cases produce equivalent results. See also [Hau91] for a discussion in the 
uniform convergence framework. 

Bayesian methods should not be confused with recent results that back-propagation 
methods are Bayes optimal [HP90]. Bayes optimal refers to the property that back-propagation 
methods should produce a network approaching the greatest lower bound on error (or more 
generally, risk) as the training sample size approaches infinity. Bayesian methods described 
here share this property but also have a more powerful property: They are approximating 
a method that produces a classifier/regression that will on average have equal or lower er- 
ror (risk) than a classifier /regression produced by any other method applied to the same 
training sample. Notice, this holds for the current training sample, not just the infinite one 
considered by Bayes optimality. Also, notice, that this powerful property rests on certain 
assumptions, which are discussed below. This requires careful use of techniques in practice. 

These properties hold because Bayesian methods are “normative” or “rational” [Ber85, 
HHL 86 ]: any other approach not approximating them should not perform as well on aver- 
age. As an example, minimum encoding methods such as MML [WF87] and MDL [Ris87] 
are often viewed as approximate Bayesian methods. Accordingly, a more exact Bayesian 
approach convincingly outperforms these encoding methods on the task of learning tree 
classifiers [Bun90] made popular by Quinlan’s ID3 algorithm [Qui 86 ]. 

Uniform convergence methods [BH89, Hau91], the basis for computational learning the- 
ory, approximate Bayesian methods when the sample size is large [Bun91]. Uniform conver- 
gence methods require a sample size large enough so that the prior term found in Bayesian 
methods becomes insignificant (and so can be ignored). Bayesian methods for connectionist 
networks are therefore an important complement to these and can provide keener insight 
about learning from smaller samples to the algorithm designer. 

It is important to bear in mind, however, that Bayesian methods are not a replacement 
for uniform convergence methods, as indeed, they are not a replacement for other techniques 
such as cross-validation or minimum encoding methods, etc. All methods have particular 
algorithmic, complexity, and statistical properties that make them more appropriate in 
one engineering context or another. Cross-validation, for instance, can be fairly easy to 
implement on top of an existing algorithm so the algorithm can be got up and running 
quickly (even though its subsequent performance may be quite slow). 

The theory behind Bayesian methods rests on two critical assumptions: 

(I) the choice of models (or hypotheses) being searched must contain the “true” model (in 

practice this means a fairly accurate approximation to the ”true” model), 

(II) and the choice of prior over these models should represent a reasonable initial preference 
for models in the search space. 

Although there is considerable literature on how to choose a prior to minimize the assump- 
tions implicit in the prior [Ber85], and in practice we often consider a range of priors for their 
suitability [Mac91, Bun90]. Bayesian methods then guarantee best average case performance 
given these two assumptions and a third assumption: 

(III) that the approximations made in implementation are sufficiently accurate. 
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Poor modeling will lead to strong but inappropriate assumptions. For example, a single 
layer perceptron network cannot capture certain higher-order functions, so is inappropriate 
for such tasks. So Bayesians say “A good Bayesian will usually out-perform a non-Bayesian 
but a bad Bayesian can do far worse” [Goo83]. In contrast, uniform convergence methods 
make no assumptions (the model space does not have to contain the “true” model and no 
prior is needed) but can only guarantee good performance in the worst case, so they sacrifice 
the guarantee of average case optimality. Notice that for larger samples, the worst case and 
the average case converge but for smaller samples the worst and average case can be far 
different [Bun91]. 

Non-Bayesians often say that “Bayesian methods require the assumption of a prior on 
networks which may not be known in practice”. In fact, every algorithm or method is 
making implicit assumptions about a prior that can be explicitly calculated by running 
the algorithm on many different data sets and seeing the networks that result [Bun90]. In 
other words, some kind of prior assumption is unavoidable in practice, except where sample 
sizes are sufficiently large so that prior assumptions become submerged in the wealth of 
information from the sample. So rather than ignoring the unavoidable, Bayesian methods 
offer principles for developing a prior and evaluating a prior in a practical application. The 
intent of Bayesian methods then is to make all assumptions underlying a method explicit 
and then reconstructing the most rational method based on those assumptions. Techniques 
for making the Bayesian assumptions more robust also exist, and are referred to as robust 
statistics [Ber85], however we do not consider them here. 


1.3 Multi-Layer networks 

The networks we consider consist of a directed acyclic graph, giving the network structure, 
together with the activation functions at each unit or node, which relate inputs to activation 
output. A directed acyclic graph is one containing no directed cycles, so the function 
computed by the network is not computed using any fixed point equations. The network 
deals with real numbers internally, although the inputs may be discrete, represented as, say, 
reals 0 and 1. 

Denote the input variables to the network as a vector of values x and denote the re- 
sponse variable which the network is intended to predict as y. In the case of regression, the 
network output corresponds to the predicted regression for y given the input variables. This 
corresponds to the expected value or mean of the real valued variable y conditioned on the 
values of the input variables x. In the case of one-of-C classification, the network outputs 
a conditional probability distribution over C possible (mutually exclusive and exhaustive) 
values for the discrete variable y, conditioned on the values of the input variables. The 
output comes from n nodes and corresponds to a vector of real values summing to 1. The 
i-th value is the estimated conditional probability that the output variable should have the 
i-th discrete value. 

Connections (given as directed arcs on a graph) correspond to the flow of information 
between units or nodes. This can be in the forward direction during inference (classification 
or regression) and in the backward direction when calculating derivatives. So a connection 
from node n to m represents that node m receives node n’s activation during the inference 
stage, often multiplied with a weight i0 m ,„. The nodes in the graph with no incoming 
connections correspond to input nodes for the neural net. Their value is given to the system 
from some external source. 



Bun tine and Weigend 


6 


Let the non-output nodes in the network be indexed 1,2,.... In regression, the out- 
put node has activation o and in one-of-C classification the output nodes have activation 
The activation of any non-output node n is denoted u n which is a real num- 
ber. The activation function for the node m is a function of the activations u n for nodes n 
inputting to node m, and the function is parameterized by a vector of parameters w m . 

The exact details of the activation functions relating inputs to a nodes activation is not 
important for the analysis here, except that they are parameterized by network weights. A 
typical activation function for a node n with vector of weights tu n would be the sigmoid 
acting on the weighted sum of its inputs tii, . . . , ui f 




1 

l + e - “^i=« 


1 

2 




The slope of the sigmoid, a, can be absorbed into weights and bias without loss of generality 
and is set to one. The bias b n for the node is denoted t0„ t o* Some other activation functions 
that have been used are independent Gaussians found in radial basis functions, logistic, 
exponential and various trigonometric functions [Cot90]. 

In the case of one-of-C classification, the output activations need to be non-negative and 
sum to one. To do this, the output activation functions may normalize a set of non-negative 
activations from nodes in the previous layer ti*, . . ., uc, using the function 


Oi(tii,...,U C ) = = . 

£*= i c 

We call this function a normalizing activation function. A related activation function is the 
Softmax function of Bridle [Bri89] given by 





1.4 Probabilistic neural networks 

Two key issues in network design are whether the network can sufficiently approximate 
the “true” function that needs to be computed, and whether the network structure and 
weights has an interpretation or intrinsic form that can be explained as the “knowledge” 
in the network rather than treating the network as a black box. This second issue is more 
important than one might think. If network weights have some clear interpretation then we 
can configure the network in coherent ways, an expert can view the knowledge represented 
by the network, and we can more readily assign meaningful priors or some other initialization 
for the network weights [TSN90]. Probabilistic interpretations of networks is also discussed 
in [Bri89]. 

In this section, we present network structures where these issues are well understood. 
One of the advantages of the general network approach, however, is that the one network 
package can be used to implement all these network types and more, just be replacing the 
network activation functions and their derivatives. While we call these probabilistic neural 
networks, they should not be confused with the probabilistic networks that are used in 
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artificial intelligence [Pea88]. The neural networks are probabilistic in the sense that they 
implement a well-understood probability function. 


1.4.1 Logistic networks 

Suppose all variables are discrete, the problem is classification, and the problem is distin- 
guished by certain marginal probabilities of key features. For instance, for boolean variables 
xo, ®i, the key features might be *o®3» ®i®4, etc., and the distinguishing char- 

acteristics of the problem are the marginal probabilities Pr(xoX3)» Pr(x 1X4), Pr(x 3X4), etc. 
All other marginal probabilities are assumed to to follow somehow from these. A standard 
method of inference is to find the distribution which assumes the least amount of information 
apart from the importance of the given marginals. This is termed the maximum entropy 
argument and the class of distributions so derived belong to the general exponential family 
[MN89], of which the Gaussian and logistic functions are a special case. 

Suppose the key features are represented by boolean functions ci,...,c/ on the input 
and output variables (x,y) that return 1 if the feature holds and 0 otherwise. Then the 
exponential distribution takes the form 


Pr(x, y) = exp I a 0 + ^ a ,c,(x, y) 

\ i=l } 

where ao is a normalizing constant and a are some parameters that determine Pr(c,(x, y)). 
Since we are interested in the conditional probability distribution Pr(y | x), we take the 
conditional version of this to get 



Pr(y 1 x) 



( 1 . 1 ) 


For y boolean this can be further simplified to 


Pr(y I x) 


1 

1 + «cp (E <= 1 / «<(*(*> 0) - !))) 


which is the general case of multivariate logistic regression (or logit) [Ame85]. In the simplest 
case with boolean variables, the distinguishing features are yx;, yx7, yx, and yx 7 for the 
input variables x;. That is, the distribution is fully determined by the probabilities of 
Pr(x, | y). The distribution reduces to the form 


Pr(v I *) = 7 V ’ 

l + ex P (A) + E.=i «A*.) 

thus recovering the sigmoid function. More complex cases introduce higher-order terms into 
the sum such as X1X5, and X2X3X5, etc. By introducing enough higher-order terms, any 
conditional probability distribution on boolean variables can be represented. In practice, 
higher-order terms would have to be carefully chosen to represent those kinds of features 
expected in the data. 

These functional forms can be implemented as networks in the following manner. For 
a boolean output variable y, one hidden layer computes the higher order terms from the 
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input variables and the output node takes the weighted sum of these through a sigmoid 
function. In the general case there would have to be a second hidden layer after the first 
to compute the exponentials in Equation (1.1) and the final output layer has normalizing 
activation functions. We refer to both these kinds of networks as logistic networks. 

1.4.2 Cluster networks 

Cluster networks perform one-of-C classification, so they output a probability vector for the 
discrete output variable y. In the case of discrete inputs, cluster networks can be viewed as 
an extension of logistic networks where the higher-order terms in the network are discovered 
by the network during training of the first layer of hidden nodes. The first layer of hidden 
nodes detect hidden mutually exclusive and exhaustive classes in the data of the form usually 
found using unsupervised learning or clustering techniques [Pan89]. For instance, if input 
variables are all boolean then each node in the first layer can correspond to the partial 
occurrence of a particular noisy conjunct. If input variables are continuous, then nodes in 
the first layer can correspond to radial basis nodes. The detected classes are, however, fed 
into the second hidden layer which then constructs an unnormalized probability vector for 
y, to be normalized by the final output nodes. 

Cluster networks implement a probability distribution based on the assumption that 
there is a hidden variable making all other variables independent. We assume there is a 
hidden variable z with K mutually exclusive and exhaustive values numbered 1 to K. All 
other variables, the input variables xi,...,xx and the output variable y are independent 
given z , so the model for the joint probability of z, ®i, . . . , z A , y is of the form 

Pr(z,zi,.. M x A ,y) = Pr(z)Pr(y\z) JJ Pr(xi \ z) . 

For K large enough and all variables discrete, this family of distributions can arbitrarily 
approximate any other. The hidden variable z in practice has an unknown value and we are 
interested in the conditional probability of y given x, so we manipulate this joint to produce 

Pr(v | x) = k Pr (v I z ~ i) Pr ( z = i) rL=w Pr ( x i I z = i) 

= j) IUw Pr ( x < I * = i) 

Notice that if the value for x,- is missing, then the corresponding entry in the product 
can be dropped and the formula still holds. If a variable x,* is continuous, we can 
let Pr(xi | z = j) be a Gaussian distribution with mean and standard deviation aj. 
In practice, we might want to use a totally different distribution for x*. The current choice 
is sufficient for presentation. 

The following network implements this family of distributions. The network has two hid- 
den layers with full interconnections between layers. The output nodes (numbered layer 3) 
have normalizing activation functions, and represent the normalized conditional probability 
vector for y given x. The second hidden layer has C nodes, one for each value of the output 
variable y, and has linear activation functions. These activations represent an unnormal- 
ized conditional probability vector for y given x. These give activation u, for i — 1 , . . C 
corresponding to 

u i = X) Pr (y I z - j) Fr ( z = j) n Pr ( Xi 1 2 = j) ■ 

j=l,...K .A 
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The first hidden layer has K nodes, each one corresponding to a hidden class, with ex- 
ponential activation functions. These activations represent an unnormalized conditional 
probability vector for z given x. These give activation uc+> for j = 1, . . . , K corresponding 
to 

u c+ j = Pr(z = j) JJ Pr(xi \ z - j) . 

I — 

The bottom layer is the input layer. The network form is represented in Figure 1.1. 



Figure 1.1: A cluster network 


Activation functions are as follows: 
Ui 

£i=l c U 3 

Ui = ^2 tl^c+jtlc+j 


for i = 1,...,C, 
for * = 1,...,C , 


i=i K 

ti C +j = “p 


( w c+j,o + 53 53 

indiscrete 

' £ \ l°g(2™?j )) \ for j = 1 , .. K . 

i€eontinno«« \ / / 


Notice the standard w notation for weights has been dropped in the case of and 
The weights have the following interpretation: 

Wi, c +j = Pr{y = i\z=j), 
t = log Pr(x, = h | z = j) , 
w c +j,o = log Pr(z = j). 
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In addition, fiij and <Tij represent the mean and standard deviation of the Gaussian on x,-. 
Notice also that the following constraints hold: 

13 w *’ c +i = 1 - 13 e “’ i+c, ° = 1 > 13 = 1 » 

•=1,...C isl f ..JC 

and t y,*,c+i are non-negative and ti>c+j,o are negative. These constraints can easily be 
enforced by restricting the changes made to these network weights during back-propagation. 
This can be done, for instance, using Lagrangian multipliers [LeC 88 ], or simply by insuring 
that one weight is determined by all the others according to the constraint. 

As an example of these network, suppose variables y and Xi, . . . , X 5 are boolean. Then the 
conjunct xi AX3AX5 is represented at the j-th node of the first layer by having Pr(x 1 | z = j), 
Pr(x 5 | z = j) and Pr(x 5 | z = j) close to 1 and making Pr(x 2 | z = j) and Pr(i 4 | z = j) 
neither near 0 or 1, say about 0.5. The activation Uj will then be near zero if an input pattern 
fails to match the conjunct and closer to 0.25 if the conjunct is matched. Several such 
conjuncts can be represented, and so expressions in disjunctive normal form (a disjunction 
of conjunctive terms) are approximately represented. 

1.4.3 Regression networks 

A statistical model that has a simple network representation is linear regression [HT90], 
where the mean of the response variable is usually given by 

y = 13 «'.•/<(*) . 

»= 

for “basis functions” /,■ and “parameters” or weights ttfj. A corresponding network has K 
hidden nodes without weights which compute the functions /*(x), leading into linear output 
nodes that compute the linear regression. The regression is called linear because it is a 
linear function of the weights t whereas the functions /, can be arbitrary. 

The various techniques such as smoothing, regularization, cross validation, etc., applied 
to these systems can then be cast in a network framework. MacKay presents a Bayesian 
model for this [Mac91]. The choice of basis functions depends on the behavior of y antici- 
pated. Some choices are products of sin n t Xi and cosn t x, for integer n,* when x, € [— t, x]; or 
combinations of Legendre polynomials or their integrals when x, € [ — 1, 1] and y is expected 
to be in the form of a polynomial; or combinations of Hermite functions when x, E [— 00 , 00 ] 
and y is expected to be in the form of a product of exp(— Yj£=i x i/^) some polynomial 
in x. Notice each of these choices may require rescaling the Xi’s initially. 

Of course in the more general case, we can use a semi-linear model 

y = 13 wo,ifi(x,wi j0 ) , 

which again has a straight forward network interpretation. 


1.5 Probabilistic Analysis 

This section covers each component of a Bayesian analysis in turn: interpreting networks 
as a probability function in Section 1.5.1; thereby calculating the likelihood of the training 
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sample for a given network and weights, in Section 1.5.2; considering priors for the weights 
in a network, in Section 1.5.3; and finally considering posteriors for the weights in a network, 
in Section 1.5.4. 

The notation E^|b (/(A, B)) denotes the expected (mean) value of the function /(A, J3) 
when A is distributed according to the probability function or probability density function 
Pr(A | B). If A is continuous, this is calculated with an integral 

Ea|* (f(A,B)) = J f(A,B)Pr(A \B)dA, 

and if A is discrete this is calculated as 

Ea\b (f(A, B)) = £ /( A, B) Pr(A \ B) . 

A 

For mixed continuous and discrete variables, combinations of these formula apply. Similarly, 
the notation (/(A, B )) denotes the variance of /(A, B ), usually calculated as 

va\b(HA,B)) = E AlB ^(f(A,B)--E AiB (f(A,B))) 2 ) . 

1.5,1 The network likelihood function 

To do Bayesian or likelihood analysis on feed-forward networks, we have to use the network 
output o(z, tu), a function of the inputs x and the weights tu, to construct a likelihood 
function l(y | x, w) for the observed pattern (x, y). This likelihood function gives the condi- 
tional probability distribution for the output variable y conditioned on the input variables x, 
given a particular network and weights. This likelihood function is the basis of all subsequent 
analysis. 

In the case of classification, the likelihood function is a probability distribution repre- 
senting the network’s estimate of the “true” conditional distribution for y given x. The C 
network outputs therefore give a conditional probability vector. The i-th output Oj(x,u;) 
represents the conditional probability that the discrete output variable y takes its i-th value. 
So 

l(y\x,w) = ° 9 (x,w) . 

In the case of regression, the likelihood function is a probability density function (it 
integrates over the domain of y to 1) of the form l(y | x, tu, F ) for the output variable y 
conditioned on x, the network weights and some other information F such as a standard 
deviation of error. The network output is usually defined to be the mean of the likelihood 
function given by 

o(x,tu) = / yl(y | z,w,F)dy . 

Jy 

The likelihood is usually defined in terms of some error model that is a function of y, the 
mean o(x, tu), and some other error parameters. An error model commonly used is the 
Gaussian distribution with mean o(z, iu) and standard deviation <r that is unknown (so has 
to be estimated along with the weights w). This means the true y is expected to vary about 
the mean o(x, tu) with standard deviation <x. It is more realistic that the standard deviation 
itself should be a function of x too. In this case, a second output node can be connected 
to the network to estimate the standard deviation; we denote its output by </(x,r), and 
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note that the weights r will have parameters in common with w. Vapnik [Vap82] has also 
suggested using the Laplace distribution as an error model when the experimental conditions 
may vary with maximal uncertainty but the standard deviation is still independent of x. 
This leads to the following choice of error models: with known/unknown standard deviation 

, , , x 1 ( ( y-o(x,w)) 2 \ 

'«< * l *.••') = 7S elp ( 2 ? — ) • 

with x dependent standard deviation o' (z, r) of 


Igd(v I x,w,r) 


1 

— — exp 

y/2x&(x ,r) 


( (y-o(x,w)) 2 \ 

V 2o'(*,r) 2 ) 


» 


and with experiment dependent standard deviation of 


h{y I x,w,A) 



(- 


|y-o(g,tp)| 

A 



1.5.2 The sample likelihood 

Given a training sample, we need to look at the likelihood of the sample as a function of the 
network parameters w. This assumes the weights w and any other parameters are known 
and gives the likelihood that the sample seen would occur. Bayesian learning equations are 
presented in the next section that show how this leads to learning of the weights. 

For a sample consisting of N examples given as a list of input vectors x = xi, . . .,xjv 
and corresponding output values y = yi, . . . , j/jv* the sample likelihood in the classification 
problem (assuming examples are independently and identically distributed) is given by 


I(y|*,w)= IJ o 9l (x it w ) . 

The sample likelihood in the regression problem with Gaussian error and standard deviation 
a is given by 


L(y | x,w,a) 


n 


i 

V^ir<T 


exp 


^ (y, - ojxj, w)) 2 ^ 



where 

» 2 = ^ X! (y< — o(xi,tv )) 2 

is the observed average squared error. The sample likelihood for the Laplacian error model 
is similar. 

Maximum likelihood methods for network training use the negative logarithm of the 
likelihood as a “cost” function (to be minimumized) [BW87, EJM90]. 


1.5.3 Prior probability of the weights 

In the classification case, prior probabilities over the network weights w give our a priori 
belief that o,*(x', i») for i = 1, . . . , C is the “true” distribution for y given x. In the regression 
case, prior probabilities over the network weights w give our a priori belief that o(x', w) is 
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the “true” mean of y given x. The choice of prior represents an assumption about what 
form w should take. By representing the prior as a probability distribution Pr(tu), this 
prior assumption is not expressed as hard constraints such as w njm = 1 or 0 < w n , m < 0.5, 
but rather as preferences among the many different possible values. 

There is considerable literature written about priors (see [Ber85] and references therein), 
and a number of methods exist for obtaining priors that are considered “non-informative” 
about the class of theories being considered. We suggest a few priors here, however believe 
more experience and research is needed for designing priors on neural networks. In general 
there is no “correct” prior, however, it is important to use a prior that has reasonable 
properties. See [Bun90] where a number of different priors are discussed for trees and 
[Mac91], where several priors are tried and compared for a single network learning problem. 

When dealing with probabilistic neural networks, as given in Section 1.4, we can often 
use the probabilistic interpretation of the network to guide us in the choice of prior. Some 
examples are given below. 

Priors on cluster networks 

For instance, the cluster networks of Section 1.4.2 have weights corresponding to Gaussian 
parameters or proportions. The parameters ti^c+j for i = 1,...,C are parameters for a 
multinomial distribution, so £*=1 c u> lt c+j = 1. A standard non-informative prior for 
this is [Ber85, Bun90] 

Pr(wi tC +j , . . . , t v c ,c+j) oc U wJJS} 1 , 

»= i c 

where we have ignored the normalizing constant so the prior is “oc” rather than “=”. 
Similarly, suitable priors for the other parameters can be found by using standard non- 
informative priors for the class of parameters concerned. Notice transformed parameters 
e tDc+j, o are multinomials, so we use the same form of multinomial prior as before for these. 
However we need a prior on weights not their exponential. This requires a change 

of variables, and note that twc+K,o is a function of ti>c+i,o» • • m ™c+K-i t o* Likewise for 

gWC+j.i.k # 

Pr(toc+i,o, • •• , tuc+Jf-i.o) oc exp j 1/K ^ w C +j,o ~ v>c+K,o 1 t 

\ i=l K ) 

Pr(w C +j,i,U • • • , w c+j,i,dt—l) OC exp I 1/di 5] W C+j,i* - w C+j,i,di I , 

\ ) 

Pr(Hi,j,<Ti,j) oc . 

Priors on regression networks 

For regression networks, priors say how we expect the mean of y given x to change with 
x. A standard approach is to say we expect y to vary smoothly as the input variables 
change. For instance, for linear regression networks, one can say the average 
second derivative of y w.r.t. x should not be too large. This section argues that a suitable 
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prior on the weights takes the form 

Pr(w) oc C(w)~ K f 2 


for a quadratic function C(ii;) on the weights w determined by the basis functions used 
in the regression. This is equivalent to but simplifies the approach given in [Mac91] for 
corresponding priors. 

With x defined over the multidimensional space X , the average magnitude of the second 
derivative can be expressed as a function C(w) given by 





where the norm || • || is an average measure of the size of the second derivative at a single 
point. We assume the space X is finite but relax this condition later on. One way to 
calculate the norm || • || at the point x is to calculate the average change in y about a ball 
in x-space of fixed radius from x (i.e. x 4* Ax where | Ax| = 1) due to the second derivative. 
Using a second order expansion, this is given as: 


d*y 

dx2 





iAxj 



dAx 



for A=l, and for large A 


1 

4A 2 



cPy&y „ A ( d 2 y 

dx iM ,“' 1 \dzidxj) J 


We use the approximation below since it is proportional to the exact calculation when >1 = 1. 
Substituting into C(w) and using the regression model y = £ #ssl K ttf,/i(x), we then get 


where 


K 

C(tv) = ^ w m w„C m ,„ , 

m,n = l 


J_ A f ( PfM&Mx) , o a 2 / m (x)a 2 / n(x) \ 

4A 2 ,^/rex \ dx* dxtdxj dxtdxj J 


We can then put a prior on w in the general form 

Pr{w | a) a exp , 

where a is a prior “hyper-parameter” to be determined. A large value of a means we have 
a strong bias towards smooth functions, and a small value means otherwise. We leave a to 
be an undetermined parameter because we cannot be sure a prior how a particular value of 
a will effect smoothness. MacKay shows how to determine a a posterior , however we show 
below how to marginalize it out. 
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This gives a prior in the form of a multivariate Gaussian 

a K f 2 det 1 ^ 2 C 


Pr(w | a) = 


(2r) 


K/2 


BXP C ’( W )) ’ 


where JRT = |t^| is the dimensionality of the weight set w. Since we are not really sure what 
a G [0, oo) should be anyway, we can marginalize it out. Since a appears as a factor of 
C( to), it is a scaling quantity so a suitable prior is [Ber85] 


Pr(a) a 1/a 


This gives 


Pr(to) = f Pr ( w | a)Pr(a)da 
Jo 


oc 


t k I 2 C(w) k I 2 v 


During gradient descent, the important contribution is 


dC(w) 


d - logPrjw) dlogC7(tfl) _K_ 

dwi ' dwi 2C(w) dwi 

If we compare this with the derivative of the original Pr(u/|a), then it follows that 

K 

where w is the weights actually used. Since C(w) is the measure of smoothness (where 
smaller is smoother), this says that one should change weights to increase smoothness of the 
mean function y. Notice from the definition of C{w) that the partial derivatives are 

calculated when calculating C(u>). Likewise for second derivatives. We therefore get that 
all first and second derivatives of —logPr(w) w.r.t. w are calculated with the same cost it 
takes to calculate C(w). 

Values of C m , n for a given set of basis functions only have to calculated once, but 
usually lead to a complex matrix. In this case, a diagonal approximation to the matrix may 
be sufficient. For instance, when using basis functions in the form 

n sin riiXi JJ cosniX,* 


for different n, and index sets /„ , then the diagonal term is given by 

2 

/ — > 

Cn^n CX 


(£■<) ■ 


Entropy priors 

Zellner’s maximal data information prior for a probability function Pr(z \ 0) on parameters 
0 [Zel90] takes the form 


Pr{0) cx exp (-7 x | tf (0)) , 
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where 

L\e(9) = ~ j Pr(z\0)logPr(z\0)dz 

is the usual entropy function giving the entropy for z distributed as Pr(z\0), which is a 
function of 9 . 

When doing classification, we are constructing a network for the conditional distribution 
Pr(y \x,w). Assuming a distribution for x is known, a prior using the above method can 
then be formed as follows. 

Pr(w) a exp (- ( I x + E, («>)))) 

oc exp(-E, (I,|,,.(u>))) , 

« <*p ^ • 

This assumes the entropy of x, I x is independent of w . Since the distribution for x is not 
actually known, the last line estimates the expectation E x (•) with the empirical average 
from the sample of input vectors x = xi, . . ., x^. Since we are forming a prior on weights 
(parameters) for a conditional distribution of y given x, which presumably is independent 
of the distribution for x, we can use the sample of points for x to estimate a distribution 
for x. For the classification case, 


c 

= ~ X) W ) l°g °i( X ^ U>) • 

*=1 

For the regression case with x dependent standard deviation of </(x,r), 
Iy\*4,v>A w y r ) - log o'(s<i,r) + constant . 


Weight elimination and priors 

One particular prior we have worked with assume the weights have a prior probability that 
corresponds to 

1^2 j ||| 2 

— log Pr(w) = AC(u;) 4- constant = A *’"1 — 7-5 + constant . (1.2) 

The unspecified constant term is required to ensure Pr(w) normalizes to 1, but its value 
is unimportant since we primarily deal with derivatives. This term was first proposed by 
Rumelhart in 1987. wq is the scale for the weights: For |tyf, m | ^ the cost of a weight 
approaches unity (times A), allowing the interpretation of the complexity term as a counter 
of significantly sized weights. For |t wi tfn | the cost is close to zero. 

Relevant weights are drawn from a uniform distribution (to allow for normalization of 
the probability, up to a certain maximum size). Weights that are merely the result of a noise” 
are drawn from a Gaussian- like distribution centered on zero; they are expected to be small. 
The corresponding bump around zero can be approximated by a Gaussian with variance 
a 2 = ti/j)/A. The larger A is, the closer to zero a weight must be to have a reasonable 
probability of being a member of the “noise” distribution. 
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The learning rule is then to change the weights according to the gradient of the entire 
cost function, continuously doing justice to the trade-off between error and complexity. This 
differs from methods that consider a set of fixed models, estimate the parameters for each 
of them, and then compare between the models by considering the number of parameters. 
Taking a complexity term into account during learning required A to be adjusted dynamically 
during learning. The details of this procedure are described in [WRH91]. 


Bayesian interpretation of weight elimination 

To place this approach in a Bayesian framework, consider the case where we place a prior 
on A, Pr(A) to determine the overall prior on w . We do this as follows. We say 

Pr(w | A) a c- AC <»> , 


where C(ti/) is the sum given in Equation (1.2) and notice that this distribution is improper 
because it cannot be normalized (C( w) is bounded above in magnitude by the number of 
weights |tu|). If we restrict the weights w to remain inside a sufficiently large finite region 
W in tu-space of area |Wj, then we get 


Pr(w | A) « 

Pr(tu) « 

d — log Pr( w) _ 
dw njTn “ 

d 2 — log Pr(ttf) _ 
dw ntfn dwi yk 


_L e -A(C(»)-M) 

\w\ e 

A) e- A < c (»>-H>dA , 

dC{w) 

' dw n ,m 

&*Cjw) , dC(w)dC(w) 
A| “ dw n>m dw , , fc dw n>m dwt t k 


where nx\ w is the expected value of A given weights w and <r\\ w is its variance. Notice these 
derivatives are essentially the same as the derivatives of the original form in Equation (1.2) 
except that A is now set automatically. The terms are given by 


/ A Ap(A)e“MC(«)-k|) dA 

^ ~ J A p(A)e-*<c(«)-M)dA ’ 

<J x\'w is defined similarly. Reasonable values appear to be of the form 




1 

C(w) 


which means A is expected to vary inversely with the number of small weights. 


1.5.4 Posterior analysis 

To do Bayesian analysis, we need to be able to determine relevant posteriors from priors and 
likelihoods. For the regression case with Gaussian error and unknown standard deviation a 
this is as follows. The product rule for probabilities gives the following: 


L(y | x,w,a)Pr(w,<r)Pr(x \ w,<r) = Pr(iy,tr | *, y)Pr(y | *)Pr(sc) 
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where L is the sample likelihood as before, Pr( w,<r | x,y) equals the posterior probability 
of the network weights w and the parameter to the error model a. Pr(x | tu,<r) is the 
probability of seeing the inputs * given w and a. This can be assumed independent of w 
and a since they just parameterize the model for y given x, so Pr(x | t/>, <r) = Pr(x) which 
is the prior probability of seeing the set of inputs x. Pr(x) then cancels out both sides. 
While Pr(y | x) also has a particular meaning, for our purposes it is a normalizing constant 
because it is independent of w and a. Re-expressing this, and assuming that w and <r are a 
priori independent, we get a standard Bayesian learning equation (see, for instance [Bun90]) 


Pr(w,a\ x,y) 


L(y j x, tu, (r)Pr(w)Pr(<r ) 
J Wftr L (y I x,w,<r)Pr(w)Pr(<r)dwd<T * 


When using the network, the weights w are the primary concern and a is of secondary inter- 
est, so we would also like to know Pr(w \ x,y), which can be calculated from f 9 Pr{w y <r \ 
y)d<T. If we assume the standard non-informative prior for the standard error cr, Pr(cr) = 
l/<r [Ber85], the integration for o given w can be done in closed form using the T integral 
[Ber85] (with a change of variables r = 1 /<t 2 ). This yields 


Pr(w | x,y) 


L(y | x,w)Pr(w) 
L L (V I *,w)Pr(w)dw ’ 


where 


L(y\x,w) 


r(^) 1 


Furthermore, the expected value of <r 2 given w and z, y can be calculated using the same 
integral, giving a standard result in statistics: 




N 


N-l 


(1.3) 


A similar calculation can be done for the Laplacian error model with unknown standard 
deviation A, and this time 


(A 1 ) = n •£ l» - 

To summarize, posterior probabilities for a set of network weights w are as follows. 

Lemma 1.5.1 In the classification and regression frameworks described above for neural 
networks , assume the input x is a priori independent of the network weights w and other 
parameters such as a and A, that is Pr(x\w,<r) = Fr(x). Posterior probabilities of network 
weights are as follows. For regression with Gaussian error and unknown cr, 

Pt(w | x,y) a Pr(tw,y|z) = Pr(w) EilxJ — , 

(JVt )*r-y/N (S, =1 N (yi -o(z,-,u))) 2 )~ T ’ 

for regression with Laplacian error and unknown A, 

w 

2^(E <= 1 ’ 



Pr(w\x,y) oc Pr(w,y | z) = Pr(to) 
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and for classification 

Pr{w\x,y) oc Pr(ti>,y|at) = Pr(w) JJ o 9 i (xi,w) . 

i = 1 If 

These posterior probabilities define the Bayesian solution to the problem of training net- 
works. With a Gaussian error model, high posterior weights w trade-off the prior Pr(iu) 
and the mean-square error. With a Laplacian error model, the trade-off is with the average 
absolute deviation. In all cases, as the sample size N gets large, the prior term will become 
negilible and can be assumed constant. 


1.6 Analyzing weights 

The section describes how we can use the posterior analysis just given to assist basic tasks 
done during network training. In Section 1.6.1 we describes how we derive “cost functions” 
for a given set of weights from the posterior formulae and Section 1.6.2 describes how we 
can evaluate a set of weights. The cost function allows us to find a locally optimum set 
of weights and the evaluation allows us to compare the quality of one local optimum with 
smother. 

1.6.1 Cost functions 

The posterior probabilities just given represent functions that should be maximized. Cal- 
culation is usually done in logarithms to prevent arithmetic underflow and to turn products 
into sums. Hence when searching for a high posterior set of weights we would like to mini- 
mize the “cost function” - log Pr(w \ x, y). Notice that 

-log Pr(w\x y y) - - log Pr(tu,y | x) + log Pr(y \ x) , 

and since the term log Pr(y \ x) is constant in w and difficult to calculate in general, we 
can instead minimize the cost function 

Co8t(u;) = - log Pr(w,y | sc) , 

which can be directly computed using Lemma 1.5.1. 

These cost functions differ from maximum likelihood methods for network training 
[BW87, EJM90] in that they introduce a prior term and sometimes have “nuisance” pa- 
rameters eliminated. For instance, <r is termed a “nuisance” parameter when trying to 
determine a good set of weights and have no concern for the typical error. The prior term 
corresponds to the regularizing term found in smoothing methods [HT90] and used to good 
success in techniques such as weight-elimination [WRH91]. A similar Bayesian formulation 
is given in [Mac91]. 

Using the prior discussed in the previous section, with maximum a posterior analysis we 
would minimize in the regression case with Gaussian error and unknown a 

- log Pr(ti7, y | *) 

= N 1 log ^2 (yi - o(xi , iu)) 2 - log Pr(w) + constant , (1.4) 

t=l N 
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where the logarithm of the prior is determined as before. If the value of a is known the cost 
to be minimized is 

Yi X] “ °( x *» w )) 2 - lo 8 Pr ( w ) + constant . 

Notice the difference between these above two cost functions. In the second case, the cost 
function is proportional to the mean squared error plus a “regularizing” term, but in the 
first case the logarithm of the mean squared error is taken. If the value of a is determined 
from the inputs as well, as c/(®j,r), the cost to be minimized is now — log Pr(tu, r, y | sc) 
given by 

^2 Orff- — Tj(w -o(*„n>)) 2 + ^2 log o'(x,,r)- log Pr(r,tw) + constant . 

Minimization has to be done for r and w concurrently. A prior for r was given at the end 
of Section 1.5.3. 

In the classification case, the overall cost is given by 

— logPr(iy,y | sc) = - ^ log o y .(xj, iw) — log Pr(iy) -f constant . (1.5) 

•si,..., N 

If the output variable is boolean and represented as 1 for true and 0 otherwise, then the 
first sum expands to 

- ^2 (jKlogOlfcn 11 ') + (1 - Vi) log(l - Oi(x<, U>))) . 

Notice this is the familiar cross entropy error for the sample. 

While this gives the cost function to minimize, we do not consider here the computational 
problem of how the minimization should be done. Various gradient descent, conjugate 
gradient or approximate second-order methods [BL88, EJM90] can be tried to solve the 
minimization problem. See, for instance, [PFTV88, Chap. 14] for a tutorial discussion of 
methods for solving the minimization problems for linear and non-linear regression with 
Gaussian and other more robust error models. 

Notice the formulation given here assumes learning is done in batch (or epoch) mode 
because the posterior needs to be calculated on the full sample. When initializing search, 
it may help to use smaller batches to get somewhere near a local minimum, and perhaps 
do batch learning on the full sample when refining an already reasonable solution. This 
corresponds to the stochastic gradient descent strategy that is common in back-propagation 
implementations. We do not consider which strategy is better here. 

1.6.2 Weight evaluation 

To obtain a measure of the quality of each local maximum a posterior estimate w found 
during search, an estimate of the local area under the posterior around w is usually done. 
The actual posterior value itself is not the best measure of quality because some peaks 
may be thinner than others so they contain much less of the posterior probability in their 
vicinity. For instance, consider an idealized learning problem where the scalar parameter a 
is being learnt. Suppose a has posterior given in Figure 1.2. Notice the left peak is higher 
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but much thinner. The expected value of o, E* (a), and other functions of a would therefore 
come more from the fatter peak on the right. This becomes more pronounced in higher 
dimensions, or when comparing networks of different dimension. 

This local area estimate of quality lets different local maximum a posterior estimates be 
compared on an equal footing, for instance networks with different numbers of layers or con- 
nections. This local area estimate and successively coarser approximations to it corresponds 
to the various encoding measures such as MML [WF87] and MDL [Ris87], as discussed in 
the next section. 

To make this estimate, we approximate the posterior at the local maximum w by a 
function of the form 

Pr(w | *,y) 

* Pr(w | *, y) exp (~(ti> - w) T I(w)(w -w) + 0((w - tS) 3 )) , (1.6) 

where 


I(w) = d 2 log Pr(w I x, y) _ d* log Pr(w, y | as) 

dwdw dwdw * 

— d a Coet(tp) 
dwdw 

Notice the differentials are du; rather than dw because the derivative represents the full 
matrix of second derivatives w.r.t. the different weights ti; nfm . Roughly, this approximation 
is valid for large sample size N because the posterior is found by normalizing for w a formula 
of the form g(w) N (see for instance [Ber85, p224]). The second derivative I(w) is taken of 
the cost function, the log posteriors such as in Equation (1.4) or (1.5). Generally, the sample 
size N should be at least some factor of the number of weights in the network. 

When using a uniform prior, J( w) is referred to as the observed Fisher information ma- 
trix and its deter minan t is referred to as the Fisher information. Both play a central part in 
statistical theory [Ame85, Ber85]. In some simple non-network models, such as distributions 
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from the exponential family [Ber85], class probability trees and Bayesian networks on dis- 
crete variables [Bun90], the posterior can be dealt with exactly so the approximation using 
J(tw) is not necessary. Because the constants in Equations (1.4) and (1.5) are independent 
of w, they can be ignored when evaluating the derivative. The local posterior area for w 
near w is found by integrating out w using the multivariate Gaussian approximation, and 
is given by 

(2x)l“l/ 2 

Pr(near w\x,y) » Pr(w \ x, y) ——^ , 

where |ty| is the dimensionality of the variable set w and det(*) is the matrix determinant. 

The desired quality measure is now given by the negative logarithm of the local area 
(and adding in the constant — log Pr(y \ *)) 

Eval(S) = — log Pr(near w,y \ x) 

= - log Pr(y I w, x) - log Pr(w) - iy log(2x) + ^ log det(I(w)) , (1.7) 

which wants to be minimized. Notice the first and second terms together give Cost(u;). We 
separate them out here because they correspond to the likelihood and prior components 
respectively. 

Notice also, that from the functional form of Equation (1.6), the matrix inverse of 
I(w), I(w)~ 1 y represents an approximate posterior variance-covariance matrix for [Ber85, 
p224], and is at least approximately correct in the neighborhood of w . This means, for in- 
stance, that the posterior variance of the weight w n , m in the neighborhood of u?, denoted 
v %p|x y near w ( w "» m)» is given by the diagonal entry for u> n?ro in the matrix /(u;)” 1 . That is, 
if v^j x near is large, then the weight ty n ,m is poorly determined by the data and 

perhaps should be ignored. 

The determinant det(I(ii;)) could be approximated by looking only at the diagonals or 
block diagonals (block corresponds to one node which is square in the number of weights 
for the node). Individual second derivatives can be determined using a number of methods 
[BW91b]. Fortunately, this only has to be done once for each local minima w found, in- 
stead of repeatedly during search, so a fast approximation is not essential. Fast packages 
exist, such as the Fortran MATLIB, that can find determinants and inverses of matrices of 
row /column size several hundred in a few minutes. 

1.6.3 Minimum encoding methods 

Quality measures similar to Eval(tiJ) can be derived using various quantisation, encoding 
and approximation methods. We discuss these briefly here to draw the strong connections 
between Eval(tS) and the often discussed encoding measures given in [WF87, BC91, Ris87]. 
We view these methods as potentially useful approximations but do not consider their 
application here. 


Rissanen’s MDL 

Rissanen develops his MDL measure [Ris87] as an upper bound on - log Pr(w,y | *). For 
exposition purposes, we develop a related bound here. I(w) is a positive indefinite symmetric 
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matrix (since w is a local maxima), so 1 


detl(to) < 


/ trace/(to)\ ^ 

\~\ ) 


where traceA is the sum of the diagonal entries in the matrix A . Now for large JV, 
traceI(w)/N is a mean so is approximately Gaussian, with mean 0(|to|) and standard 
deviation 0(\w\/>/N). Therefore with probability approaching 1 as N gets larger, 


^logdetJ(w) = logAT + 0(|w|) . 

We get that 


Eval(tS) = - log Pr(y | w, x) - log Pr(w) + log N + 0(|ti»|) . 

A 

This is an approximation developed by Schwarz and others [BC91]. The corresponding MDL 
form ignores the term log Pr(to) which is probably 0(|to|) anyway and has an additional 
term Li log |io|. In our case, 0(|to|) is quite large because we are dealing with networks with 
many weights. It is not unusual in practice for the number of weights to be a similar order 
of magnitude to sample size W, so ignoring terms of order 0(|to[), as this approximation 
does, is probably unwise. We conclude this approximation is informative but perhaps too 
crude. 

Wallace et aL and Barron et aL * s measure 

Wallace and Freeman [WF87] and Barron and Cover [BC91] interpret a form like Eval(to) as 
the cost of encoding y and to given x. Wallace et aL refer to this as the minimum message 
length (MML). Of course, y and to can only be encoded to finite precision (in classification, 
y is finite already). The precision of y is implicit in the supplied data, so if the volume of 
the precision for the data vector y is S then the cost of encoding y given to and x is 

-log Pr(y | 55, *) -log£ . 

Because y is encoded given to, we must encode to without knowledge of y. The second, third 
and fourth terms of Equation (1.7) are not an appropriate code length, however, because 
I(w) is dependent on y. 

^The third and fourth terms of Equation (1.7) essentially correspond to the “precision” 
of to because if the continuous space for to is quantified then the cell in the neighborhood of 
to has prior probability given approximately by Pr(to)Area(to) where Area(to) is the area of 
the cell, so — log Area(to) is the precision to which to is specified. To interpret this precision 
component, notice that if the weights to happened to be a posterior independent (an unlikely 
event), then 

.-i|!l°g(2x) + llogdet(/(«»)) « -J> (2.51 • S- (w»,m)) • 

»,m 


* The dete rminan t i» a product of eigenvalues, whereas the trace is a sum of eigenvalues, which in this 
case are all non-negative. The inequality follows by maximizing A,- given a fixed value for ^ . A;. 
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The recommended precision for encoding each weight is therefore approximately 2.5 times 
the posterior standard deviation of the weight. 

Wallace et al and Barron et al make the fourth term in Equation (1.7) independent of 
y using an involved quantification argument. We can interpret this by noting 

^ log det (/(t?)) = ^logJ\r + ilogdet^I(tw)j . 

The term jjl(w) is now in the form of an average since in general I(w) is a sum across the 
N points in the sample plus a fixed prior term. We now replace this “average” observed 
Fisher information matrix by what is called the expected Fisher information matrix: 

Coding(y, w) = - log Pr(y \ w, as) - log 6 - log Pr(w) - -!y log(2x) - -!y 

+-y + 1 logdet(7(ui)) , 

where I(w) is the expected Fisher information matrix (the expected value of the observed 
Fisher information for a single pattern) given by 2 


I(w) 


E 

E, 


( d 2 log Pr(x,y\w)\ 

V d^L ) ’ 

(«*- ■ 

d 2 log Pr(y | u>,i<) 


jj E w(- 


dt^dty 



where (x, y) in these formula denotes a single pattern rather than the training sample (x, y). 
Notice that this last formula is an estimate of jjl(w) involving knowledge of the sample 
inputs x but not the sample outputs y. It is therefore a valid inclusion in the code-length 
of U). 


1.7 Applications to network training 

This section describes some applications of the tools just developed. 

1.7*1 Weight variance and elimination 

The Gaussian approximation to the likelihood near a local maximum a posterior also gives 
a statistically sound method for pruning of networks [LDS90]. The matrix inverse of /(uJ), 
J(tS)” 1 , represents an approximate posterior variance-covariance matrix for tu, so we can 
use this to test if a weight is significantly different from zero. Those that are not can be set 
to zero. 

3 There is actually some slight of hand here. Wallace ct a l and Barron et al. develop their framework to 
deal with likelihoods of the form p(x|ti/) for data x and parameters w whereas we are considering conditional 
likelihoods such as p(y|w,x). Here we give our interpretation of their methods in this different context. 
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If we use the diagonal approximation to this matrix, then we get an estimate of the 
posterior variance of the weight tu n>m in the neighborhood of w 


|x,y (u>n,m) « V «|x,y,„eor W _1 j 


d log Pr(w | a?,y) 


dw n<m dw n ,„ 


Because ti; is a local maximum for the posterior, the second derivative must be non-positive 
so the variance estimate non-negative. A zero variance estimate means some other estimate 
will have to be made. A large variance means the sample gave us little idea as to what 
value the weight should be. So if the magnitude of a weight is within a standard deviation 
(square root of the variance), the weight can be safely set to zero, which is similar to the 
suggestion in [Ish90]. If the standard deviation of a weight is small, and the weight itself 
w njm is smaller than the standard deviation, then the weight can be safely set to zero. 
Alternatively, if the weight’s standard deviation is large but the magnitude of the weight 
larger, then the variance gives no support for zeroing the weight. 

A more thorough test goes as follows. The quadratic term (w — w) T I(w)(w — w) appears 
in the multivariate Gaussian approximation for the posterior for the weights w. For large 
JV, we therefore get that this quadratic term is approximately x-squared with |iy| degrees 
of freedom. The X% confidence region (for instance, X = 99.0) for the weights w is the set 
of weights within the upper X% point of the X\ w \ distribution, xf w \,x> 


w : (ti> — w) T I(w)(w — w) < xf»|,x • 

For instance, the upper 99% point for 30 weights is given from standard x-sq uare d tables 
as 50.89. We therefore proceed as follows. We repeatedly set weights to zero while the 
resultant weights remain in the X% confidence region. Notice this simple test can be done 
directly from J(ty) without having to first calculate a determinant or inverse, so might also 
be done at stages during the back-propagation cycle. 


1.7.2 Prediction and generalization error 

Once search has located some local minima w to the cost function, they can be used in 
inference on new patterns. A naive approximation is to say that w must now be w and to 
make inference about y and <r 2 , etc., based entirely on this local minima w using y = o(x', £5), 
etc. The reason this is naive is that in reality we cannot be sure that the “true” or “optimum” 
w is w. For instance, with a sample twice the size we might change our estimate of w quite 
dramatically. A full Bayesian approach says to take into account the many other values of 
w near w or even some other local minima because these other values just might happen 
to be the “true” one. Only when we have a really large sample (for instance, when uniform 
convergence bounds tell us the sample is large enough [Hau91]) can we be sure that w is a 
sufficiently good estimate so that no other need be considered. 

In principle what we would like to calculate for a new pattern x' is E w | Xty (o(x', tu)) and 
E w ,<r[x,y ( <7 ’ 2 ) » e ^ c - These values give us the posterior average of the quantities o(x', it;) and <x 2 
that give better predictions for these quantities on average than those calculated for w ~ w. 
The problem here is that we cannot calculate the posterior Pr(iu | ac,y) in reasonable time, 
and only have the Gaussian approximation to the posterior described above. 

This section describes how these predictions can be approximated using the posterior 
approximations available. These approximations all make use of the second derivatives of 
the cost function and the approximate weight variance discussed above. While it is hard to 
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predict how accurate these approximations will be in practice, their form at least gives some 
idea of the interaction between weight variances and the behavior of the network output for 
w in the vicinity of w. 

Predictions are given for three quantities. The first is an approximation to E w j x>y (o(x', to)). 
In classification this can be used to get an estimate of the posterior distribution for y given 
x\ and can be used to predict generalization error, and in regression this gives an estimate 
of the posterior mean of y. This can also be used to give generalization error for classifica- 
tion. The second prediction is for v w | x>y (o(x', tu)) which measures our uncertainty in the 
first prediction. Notice as the sample size gets large this uncertainty will shrink to zero. We 
give two versions of this uncertainty depending on which approximation to E w j Xjy ( [o(x’,w >)) 
is used. The third prediction is for E W| ,| X|7 (a 2 ) which predicts generalization error for 
regression with unknown standard deviation. 


Predictions for a local minima 

In what follows, we use the notation E w | x near g(U(w)) to denote the expectation of the 

quantity U(w) when averaged according to the approximate Gaussian posterior for w in the 
vicinity of w. That is 


E «.|x,y,»ear .W")) = f* U ( W ) CXP “ w) T I(w)(w - w)j dw . 

We evaluate these integrals by using standard moments of the multivariate Gaussian [Ber85], 
and approximating U(w) in the vicinity of w using the second-order Taylor expansion which 
in vector notation is as follows: 


U(w) M U(w) + (w- w) T ■ 

dw 

+ - (w - w) T • -■ £( —) 
2 V ’ dwdw 


(w-w) . 


Again, denotes a vector of first derivatives and the second derivative a matrix. This 

gives 


»|x,y , near w 


~(tT( W )) « U(w) + ± trace (/(w)" 1 . J , 

' \wzzws 


where, again, traced denotes the sum of the diagonal entries of the matrix A. If any of the 
approximate weight variances are large, then the approximation of E w |^ y near ~(J7(™)) will 
require accurate estimates of U(xv) for w far from w. In this case the Taylor expansion will 
be a poor approximation and our approximate predictions will be poor. Notice we could 
also use the diagonal approximation for the matrices J(t£) and the second derivatives of 
U(w), however, the estimates may then become very poor and only be useful to indicate 
where potential problems lie. These and other approximations are discussed in [Pre89]. 


®w|x t y,near w (°i x ^ w )) : posterior expected output of the network given input x', av- 

eraged over weights in the vicinity of w. This is approximated by 


o{x',w) + ^ trace 



d 2 o(x\w) 

dwdw 
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This modifies the output o(x', w) to account for how the output o(x', u;) varies in the 
neighborhood of zfi, tempered by the posterior variance-covariance of w. In the case of 
classification, this gives an estimate of out-of-sample accuracy for the network given 
by 


1 J \ 

(**(*<• ")) 


• ( 1 . 8 ) 


w ti; ) ” °( x/ ’^)) 2 ) : w ^ en using o(x', tD) to estimate o(x',u>), the ex- 

pected variance of the output in the neighborhood of u), which gives a measure of 
variance for our posterior uncertainty when using o(x', t/5) to estimate o(x', w ). For 
instance, in the regression case o(x', w) corresponds to an estimate of the mean value 
of y conditioned on x'. The variance is approximated by 


do(x', w) 
dw 



do(x',u;) 

dw 


(1.9) 


»|x, 7 ,near » W 1 '^)) : the variance of the estimate E w|Xty ne<jr -(«(*', u>)) in the neigh- 
borhood of w. This is approximated by 


fo^™) T . . d °(^. n>) 

du; y } dw 



d 2 o(x', t/j) 
dtz;du> 



Notice this is lower than the previous variance estimate of Formula (1.9) due to the 
second term which corrects for the fact that we are now using a better estimate for 
o(x', tu). 


E «.,<r|x,y,neor » ( a2 ) : the posterior expected value of a 2 in the regression case with Gaus- 

sian error and unknown <7. This is sometimes termed the generalization error. This is 
approximated using Equation (1.3) by 


JTn £ ( E »|x,y, B ear^(°( X *’ W ))-V .) 2 + T^T £ V „|x,y,» e «r » W*'* 1 «)) » 

where the variances in the second sum were just given. This simplifies to 


N 


s 2 \ ~ + 


N 


N- 1 '•«* 2(N — 1) 


trace \ 


(w-«L) • 


Notice the first term is the observed variance, and the second term increases this 
because the estimate E v | neor ^(o(x f *,ty)) is only approximate and has a variance 
of its own. 
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Multiple networks 

If multiple local minima tJ of similar quality have been found during restarts in the search 
for weights, then the expected values above can be averaged over the different local minima 
w to produce a pooled estimate. This same “multiple models” approach can produce sig- 
nificant improvement in out-of-sample prediction when learning classification trees [Bun90]. 
It corresponds to a Monte Carlo estimation of the full posterior for tu, rather than just an 
approximation about w. Suppose a small set W of local minima are found, and for each 
we have a proportion p(w | x, y) to be used in the Monte Carlo estimation below, where 
| *?!/) = 1. For instance, these proportions could be constructed proportional 
to Pr(near tJJ, y | *), although equal proportions are sometimes used in Monte Carlo es- 
timation. The estimate of the posterior for the weights w is now a weighted some of the 
Gaussians in the neighborhood of each w: 

Pr(w | *, y) « p(w | x, y)Pr(w | x, y, near w), 

where the Gaussian approximation described at the beginning of this section is used for each 

Pr(w | x^y^near w). 

This leads to the following corrections for the previous predictions. To find the expected 
output of the network for different weights, we pool the expected outputs for the individual 
local minima in W: 

E w |x,y (o(*\ w)) « Y, Pi™ I •» v)E w|x y neor ~ (o(*', w)) . 

O'*. 

w£W 

To find the variance of this value, we pool the variances for individual local minima, together 
with a measure of how much the output varies from minima to minima; 

V.|x,y (o(x\ W)) « Y Pi™ I V) V ,|x,y,ne.r » (°( X '' W )) 

w£W 

+ v*^. (e . u;))l 

«o|x,y \ ttlx^near w \ \ » /// 

To estimate the standard error <r in regression with a Gaussian error model, we pool the 
individual estimates of a. 


Ew,<r|x,y (<T 2 ) 


E p(ul|x,y)E , *(<r 2 ) • 

1 Wj4r\ x^y^near to \ ) 

w£W 


1.7.3 Adjustments for missing values 

In many practical problems, training patterns are supplied with some values of variables 
missing, unknown or undetermined. Of course if a pattern has its output value missing the 
pattern can just be ignored. If an input variable has a value missing, however, we need 
to deal with it. Approximate methods for dealing with this kind of problem exist when 
learning tree classifiers [Qui89]: 

• One can ignore the training pattern with missing values. Stochastic learning can 
proceed initially using only the patterns without missing values. 
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• One can replace the missing value by some simple estimate such as the modal or mean 
value. For instance, if the boolean input variable x,- is either 0 (false) or 1 (true) and 
if it is unknown, set it to be about 0.5. 

• One can complete the pattern in various ways (to fill in missing values) and treat each 
of these completions as a partial pattern (so the sum of fractional patterns adds to 
1). We explain this idea in the context of feed-forward networks below. Tree learning 
algorithms do this completion in an efficient demand driven manner (i.e. they complete 
a missing value only when the value is asked for by the algorithm) that unfortunately 
is not available with feed-forward networks. 

The second approach works quite well and is the simplest to implement, however the third 
approach gives the best performance in terms of out-of-sample prediction. 

In this section we develop a modified algorithm for handling missing values related to the 
third approach above. It also approximates the Bayesian normative approach for handling 
missing values because we derive the approach here using standard laws of probability. 
The method is another example of the use of mixture models [JJNH91] used for adaptive 
mixtures of local experts, and the equations derived have a related form. We do the analysis 
for the regression problem with Gaussian error and unknown standard deviation. The other 
cases are similar. We assume we have a model of some kind denoted F that gives rough 
prediction of the missing values from the known values. The model F might correspond to 
simple linear models or some other form readily calculated from the data. 

The training sample consists of input vectors xi, . . and corresponding output val- 
ues But in this case some of the input vectors x, have missing values. Let 

unknown(xi) denote the set of variables in the i-th input vector X* that have missing val- 
ues. For instance, given the i-th boolean pattern x< = (1, 0,?, 0, ?, ?) where ? denotes a 
missing value, then unknown(xi) is the 3rd, 5th and 6th variables. Given an assignment to 
these, x' E unknown(xi) y let x t V denote one possible completion for the input vector x,-. 
Then the likelihood of the pattern is now given by 

Kvi I X iy W y a y F) - Ex*eunknown(xi)\x h F(Ky I X%x' ,W,< t)) 

where the expectation is done using the model F for predicting the unknown values. We 
can approximate this stochastically by selecting a few possible completions compZ(x;) and 
weighting them to give 

KVi I x t ,w,<T,F) « ^ p(x'\xi,F)l{y\xix\w,<T), (1.11) 

x # €comp/(xi) 

where the weights p(x' | X*, F) indicate likelihood of the different completions based on the 
model F . This formula can now be evaluated because each likelihood l(y \ x<x',it;,cr) can 
be determined as in Section 1.5.1. For instance, in the example above, we may select 3 
completions according to the model for predicting missing values as 

compl(xi) = { (1, 0, 0, 0, 0, 1), (1, 0, 0, 0, 1, 1), (1, 0, 1, 0, 1, 0) } 

and give them equal weighting of 1/3 each. Completions and weightings can be determined 
once before the network training begins. Notice the crudest estimate is to use only the single 
modal or mean value for the missing values with a weight of 1, as is often done. 

Network training now consists of using these modified likelihoods. In the regression case 
we are considering, the standard deviation can no longer be marginalized out so the cost 
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function is now — log Pr(tu, cr, y|F, se). Suppose the patterns in missing have missing values 
and those in complete have all values given, then the cost is: 

E 2^2 (2/1 — °( x iy w )) 2 + (JV + 1) logo- — log Pr(w) -f- constant 

incomplete 

- 53 lo s ( 53 

«6miuinj yj'6comp/(n) ' 

While this looks more involved than the original cost function of Equation (1.4), its deriva- 
tives are not that different. 



3Cost(w) 

dw nifn 


d log Pr(iy) 
dw nim 


-5. £ 


(y. -o(xi,w)) 


incomplete 


do(x,, w) 
dw n<m 


1 

2<r 2 


53 53 I *<> *"> <r ) • (y. - v>)) 

inmisting x*ncompl(xi) 


dofax'y w) 

dw„, m 


( 1 . 12 ) 


where the proportions w(x’\ x it F,a) are calculated as 


w(x'\ Xi ,R<r) = ** 1 *«’ El exp (2^ fa ~ w )) 2 ) . 

I *<» F ) (^(W “ o(XiX f , w)) 2 ) 

Compare the two sums in Equation (1.12). In the second, the term for the completion x,x' 
appears weighted by the proportion w(x , \x iy F,<t) but in the term for the first sum each 
(already complete) pattern z,* is effectively weighted by 1. For this reason we say each 
completion participates as a partial pattern. Notice, also that the proportions w(x*\ x*, P, <r) 
essentially pick out that completion x* giving lowest squared error. Since the derivative of 
o(x,*x', tu) w.r.t. the completion x' is available after the back-propagation step, we could 
dynamically alter the completions x r during each learning step so their mean-square error 
become lower. 

The derivative of Cost w.r.t. <7 is in a similar form but this time a fixed-point equation 
(a also occurs on the right-hand side) for a can be derived: 


0 - Jf+l ( 53 (y< -o(*i,w)) 2 + 53 53 w (*'l x i, F ,<r)-(yi -o(xi x ',w 

\*6 comp/etc t€ms*«in g x f ncompl(xi ) 

The standard deviation a can be updated iteratively along with the weights w in each cycle. 
Again we see the “partial patterns” alter the usual form for the variance. This is a common 
result in mixture models. 

A disadvantage of this method is that each pattern with missing values now produces 
a number of completed patterns, each of which must be run through the network. For 
instance, if every pattern has missing values and c different completions are used for each, 
then computation is increased by a factor of c. One way around this would be to have the 
network first learn for patterns with no missing values, and then to fine tune by including the 
remaining patterns. Of course, the extra computation should give improved performance 
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as c is increased due to the normative justification of the approach. Also, the completed 
patterns can be dynamically altered during training with little extra overhead, to produce 
completions with lower mean-square error. 


1.8 Conclusion 

This paper has covered Bayesian theory relevant to the problem of training feed-forward 
connectionist networks. We now sketch out how this might be put together in practice, 
assuming a standard gradient descent algorithm as used during search. 

For network training, the principle steps are as follows: 

1. Choose an appropriate network structure and size based on prior knowledge about the 
application (see for instance the discussion in [LeC89] regarding choice of network), 
and select a prior on the weights. Notice a small number of different structures could 
be selected and the method will the best select the best. 

2. Construct completions and their proportions p(x f \x iy F) for each training pattern with 
missing values. If stochastic training is used rather than epoch training, each set of 
completions should always be run through the network together so that the appropriate 
term in Equation (1.12) can be calculated. 

3. Train to a local minima w as per usual but incorporating the adjustments described for 
missing values in Section 1.7.3. Appropriate cost functions are given in Section 1.6.1. 
In principle, the weight variances, weights evaluation and predictions only apply if tu 
found is a true local minima to the cost function so epoch training might have to be 
used in the last few cycles to ensure this. 

4. Possibly, use the weight elimination strategy of Section 1.7.1 to force some weights to 
zero, and continue training the network with forced weights remaining at zero. 

5. Once a local minima is found, estimate the quality of the local minima by finding 
second derivatives for every training pattern and combining them in the Evaluation 
measure of Section 1.6.2. Calculation of second derivatives is described in [BW91b]. 

6. Do random restarts of the network to repeat the last three steps and find other local 
minima. The estimated weight variances ^ w ^ xjnear £ (™n,m) or I(w) and the evalua- 
tion measure Eval(tU) should be retained for each low cost local minima w saved. If 
several different network structures are being tried, then repeat the last three steps 
for these as well. 

7. Choose a few networks with the best evaluation (Eval(tu)). Notice that the networks 
or local minima should not be chosen on the basis of their generalization error (as 
given in Section 1.7.2) because the generalization error for a particular local minima 
w is estimated based on the assumption that the network structure is “correct” and 
the “true” weights are in fact quite near w. 

8. Estimate the generalization error (for out-of-sample prediction) using the Formu- 
las (1.8) or (1.10), possibly combined using the pooled versions at the end of Sec- 
tion 1.7.2. Now that a high posterior structure and set of weights has been chosen, 
the assumptions behind the formula are reasonable. 
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Once some local minima have been found, inference can be done on new patterns. Rele- 
vant approximate predictions are described in Section 1.7.2 that apply if the weight variances 
are small. Standard inference does forward propagation of the inputs for the new pattern 
to obtain the output. The adjustments described involve approximation of output posterior 
variance of the output (error bars), better approximation of expected output and variance, 
and averaging over multiple local minima. 

These adjustments to the standard back-propagation procedure increase computation 
during back-propagation in most cases by at most a factor. Subsequent inference such as 
calculating variances require matrix calculations that can be done using fast standard matrix 
packages. While these methods come with the normative backing of Bayesian statistics, 
implementation often reveals lessons on how the various approximations and optimizations 
could be better made. One important issue here is the quality of the Gaussian approximation 
to the posterior of the weights. Since the whole approach rests on this, more evaluation and 
experience is required. A smooth transition also needs to be developed between Bayesian 
and uniform convergence methods to handle those cases where training samples become 
larger. 
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Chapter 2 

Calculating Second Derivatives 
on Feed-Forward Networks 


Recent techniques for training connectionist feed-forward networks require the calcula- 
tion of second derivatives to calculate error bars for weights and network outputs, and 
to eliminate weights, etc. This note describes some exact algorithms for calculating 
second derivatives. They require at the worst case approximately 2 K back/forward- 
propagation cycles where K is the number of nodes in the network. For networks with 
two-hidden layers or less, computation can be much quicker. Three previous approxi- 
mations, ignoring some components of the second derivative, numerical differentiation, 
and scoring, are also reviewed and compared. 


2.1 Introduction 

Recent improvements to back-propagation methods for training neural networks have con- 
sidered several tasks: speeding up learning using more sophisticated gradient descent al- 
gorithms [BL88, EJM90]; eliminating insignificant weights in order to find networks of an 
optimal size [LDS90]; the placing of error bars on the weights (to determine which weights 
are poorly estimated from the data and are perhaps suitable for elimination) and on the 
network outputs [DL91, Mac91]; and comparing the “posterior” quality of weights in dif- 
ferent networks trained on the same data [BW91a, Mac91], the MDL principle and related 
encoding techniques are generally considered to be approximations of these [BB88, BW91a]. 
These techniques share one requirement in common: they all require calculation of sec- 
ond derivatives of the energy function or network output w.r.t. the network weights. The 
first derivatives are of course calculated during standard back-propagation. Le Cun et ah 
have suggested an approximate calculation that ignores certain terms [LDS90] and MacKay, 
sometimes finding this too inaccurate, used a more costly but accurate method of numeri- 
cal differentiation [LDS90, Mac91]. Second derivatives are used by MacKay and their use 
is suggested by Buntine and Weigend [BW91a] because they have a relationship with im- 
portant quantities such as the posterior variance of the network weights, i.e. “error bars”, 
and the description length of a set of weights used in evaluating the quality of the set of 
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weights. Because these quantities are only required at the end of network training, rather 
than during each training cycle, their calculation does not have to be very efficient. Also, 
some of the above methods require the complete set of second derivatives, some require only 
the diagonal entries, and some approximations use block diagonals or other subsets of the 
full set. 


2.2 Notation 

The networks we consider consist of a directed acyclic graph, giving the network structure, 
together with the activation functions at each node. A (directed) connection from node n 
to m represents that node m receives node n’s activation during the inference stage, often 
multiplied with a weight w myn . Let K denote the number of nodes in the network. Let the 
set A denote the set of node pairs corresponding to a connection, so (n, m) E A denotes 
that there is a connection from nodes n to m. Let A * be such that (n, m) £ A* if and only 
if there is some sequence of connections in A from n to m. The sequence may be null so 
that (n, n) E A* by default. This is referred to as the transitive closure of A . For instance, 
if the network is layered and fully connected between layers, then (n,m) E A* will hold 
whenever n is in a lower numbered layer than m or n is equal to m. The activation of any 
node n E Af is denoted u n which is a real number. The activation function for the node 
mis a function of the activations ix* for nodes n inputting to node m, and the function is 
parameterized by a vector of parameters w m . The activation function for a node n can take 
the following forms (for some functions <f> n ^ and /„) 

Un = /»(#«,Wn,o) where v„ = ^ ^n,*(uk,t o n <k) (quasi-linear input) , 

(*,»)€.* 

u» = /„( t»„) where v„ = w„, 0 + ^ (linear input) . 

(fc,n)€*4 

The linear input form corresponds to the standard linear or sigmoid activation functions 
and the quasi-linear input form corresponds to radial basis nodes which in general take a 
weighted sum of squared differences between inputs and some “mean”. One exception to 
these forms is Softmax [Bri89]. The normalizing effect of Softmax nodes is readily absorbed 
into the energy or cost function for training, so these nodes are not necessary in the network. 

Second derivatives axe to be calculated of the cost function or energy function summed 
across all training patterns. Some standard cost functions are mean-square error or croes- 
entropy. So we are interested in the sum of terms for each training pattern of the form 

d 2 E 

dw mii dw n j 

where E is the cost function or energy function for a single pattern, which itself is a function 
of the network outputs for the pattern. 


2.3 Exact calculations 

This section presents an exact algorithm for calculating second derivatives of the energy 
or cost for a single pattern. Basic derivatives can be calculated as follows. This assumes 
nothing about the form of the activation functions (and follows directly from the chain rule). 
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Lemma 2.3.1 Assume a neural network framework as set up previously with the network 
a directed acyclic graph. Consider second derivatives of E which can be any function of 
activations from the output nodes. Equation (2.1) assumes that node n is not an output 
node. Equation (2.2) assumes there is no sequence of connections from m to n (i.e. u„ is 
not a partial function of u m ). 

8 2 E 


BUmdUn 

d^E 

dw mti dw„j 


d 2 E du, yr dE d 2 u t 

/ n* j d Umdu, du n + ,“. du, du m du„ 
(n,/)6 A (n,l)€A 


E 

d 2 E 


( 2 . 1 ) 


du m du n 


du m du n dw mt i dw„j 
dE dUn y 


+ lm=» 


dE d 2 u m 

dUfn dWj„;,idw mi j 

d 2 u m du, 


du m dw„j ~ dw m<i du, du„ 


(2.2) 


where the indicator function l ms£n takes the value 1 if m = n and 0 otherwise . 

Notice the final summation is only present if there is a sequence of connections from node 
n to node m. 

Corollary 2.3.1 Assume that for every node activation input is quasi-linear . Consider 
how the above second derivatives evaluate. Equation (2.8) assumes there is no sequence of 
connections from m to n. 

&E _ du m du„ dE_ du m d 2 v m du , du„ 

dwm,idw n j ’ dw m ,i dw„j ^ du m dv m du,dw mt , du n dw n j 


(2.3) 


+ li 


1 . .. dE ( d2u ”> / du ” / 

•j(» J)-° 3 Um ydw m ,idw m j / dw m>i / dw, 


"I" lfn^nli=0 


dE 

du m 


( d 2 u m / dum j dum _ d 2 u m j ( du m \ 2 \ dum dum du n 
\dv m dw m ,i / dv m / dw m j du*, j \dv m J J du n dw m ^ dw n j 


d 2 u m 

dv 2 m 


/ fe) ! ) 


du m du n 
dw m ,i dw n j 


where denotes i or j is zero ori = j, and 

*Rm,n = 


G m — 


E 

dE d 2 u m 


„ du, du, 
G ,- — — — + 


du m du„ 

2 


d u m d 2 v m 
and when n is not an output node 

Rm,n 


/(£)’ 


(m,/)€*4 


d 2 E du, duk 
du,dut du m du n 

dE du, d 2 v, 
du, dv, du 2 , ’ 


E 

l , k € ip%t -n ode $ 


= + E *-.r d « 


'dUn 


(2.4) 

(2.5) 

( 2 . 6 ) 


Proof First, when there is no sequence of connections from m to n, from Equation (2.1) 
we can prove by induction that 


d 2 E 

du m du n 


dum 

du n 


E dE du, d 2 v i 

~ r\ o o T 




du\ dvi du l 


G, 


d 2 E du, duk 


du, du, 
du m du„ 


l % k£oHtp*t-node$ 


du,duk du m du„ 
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If we now define 

___ d 2 1 5 dE d 2 u m j f du m \ du m ^2 y\ 

*“ du m du n du m dv ^ j \dv m ) du n 

we have proven Equation (2.6) for when there is no sequence of connections from m to 
n. Substituting in the definition of Rm y n into Equation (2.3), and cancelling out terms 
almost gives us Equation (2.2). Equivalence of Equations (2.3) and (2.2) is shown by first 
substituting into Equation (2.2) 

V' d / du m dv m \ dui 

(l,Z)€A dWm ' idU,dUn ~ \ 9Vm 9Ul > 9Un 

— ^ Um + du m d 2 v m dui 

~ dw m%i dv m du n dv m dw m ^dui du n 

Substituting in the value of Rm yn into Equation (2.3), noting that the indicators l*j(«-j)=rO 
and l t= o are unnecessary in the last two terms, and rear ranging proves equivalence. Also, 
we also get Equation (2.4) by induction from Equation (2.6). Notice Equation (2.4) is 
symmetric in n and m so we can drop the conditions on n and m for this equation. A final 
induction proof then shows Equation (2.6) for when there is a sequence of connections from 
m to n. □ 


Notice that if nodes use activation functions with linear input such as a sigmoid, then 
the corollary simplifies further because — 0, m = 1 if ( m > 0 £ A, etc. 

Corollary 2.3.2 Assume that for every node activation input is linear . Then the formula 
in Corollary 2.8.1 evaluate to , where Equation (2.8) assumes there is no sequence of con- 
nections from m to n: 


d 2 E 

dw m ,idw n j 


dE d 2 ^ / (dUrnV 
du m dv m / \dv m J 

dum (hin . dE^ du m duj du n 

’* dw m< i dw„j '*°du m dv m du n dw n j 


For most energy functions, the second summation in Equation (2.4) is easy to compute 
since usually — 0 for i ■£. j. For instance, for mean-square error on multiple outputs, 

we get 




l£input-nodes 


dui du\ 
dw mii dt»i 


4* constant 




To calculate second derivatives using any of the above equations, all first derivatives 
between nodes have to be computed, that is, we need for all (m, l) € A*. This applies 
even if we are only interested in the diagonal terms of t£e second derivative. Notice that 
for networks with two or less hidden layers, all these first derivatives will be available once 
all first derivatives of output nodes are computed. First derivatives for a fixed l can 
be calculated by backward propagation of derivatives from l. Derivatives for a fixed m 
can be calculated by forward propagation of derivatives from /. 
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Also, notice that if node m is an output node, then l in the summation in Equation (2.4) 
can only be m so Rm, n can be computed in constant time, assuming the required first 
derivatives exist. Second derivatives are therefore only expensive to calculate exactly if 
both weights are in hidden nodes. Similarly, second derivatives are quite efficient to calculate 
exactly if one of the weights is at most one or two node away from an output node. 

We therefore get the following algorithm for exact computation of second derivatives: 

1. Calculate G m for each node m. (This takes in the order of one back-propagation cycle 
to do.) 

2. For each node n calculate Rm in for relevant nodes m: 

(a) Forward propagate from n to calculate derivatives for each node m using 
the standard formula 

du m __ dm du m 

du n " du n dui 

(m,l) 6.4 

If we are only interested in Rm,n for n = m, then compute Rm,m using Equa- 
tion (2.4). 

Else, backward propagate starting from the output nodes to calculate Rm^ for 
each node m using Equation (2.6) for the recursive case and Equation (2.4) for 
the base case. 

This requires at the worst case approximately 2 K back/forward-propagation cycles. Any 
required second derivative can now be read from the stored calculations in constant time. 

Notice that for the special case where we are only after the block diagonals ' 

for n = m, of a network with at most two hidden layers, then calculation of Rn yn for every 
node n takes only (approximately) three back-propagation cycles since first derivatives can 
be calculated with one back-propagation cycle, and each Rn, n can be calculated directly 
from Equation (2.4). 


(b) 

(c) 


2.4 Approximations 


If we assume nodes m and n Me at the same level, then from Equation (2.1) we get 


<PE 

du m du n 


E 

(m,l),(n,k)€A 


d 2 E 3ui dut 

dutduk du m du n 


+ E 


d_E_ 3 2 n, 
dtij du m du„ 


Le Cun et aV s approximation corresponds to the case where we have linear input nodes and 
we assume = Ri,k = 0 for nodes l ^ k with l and k at the same level, giving the rule 

(they use a different parameterization so their form is different): 


d 2 E 

= E 1 

(m, l)$A 

(d 2 E 

[duf 

(S) + 

dE d 2 u, 

du\ dvf 

d 2 E 

( ^ E l 

f du m 

\ 2 , dE d 2 u m 

\ , 

0™m,i 

1 

Kdv m 

) + du m dvl 
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These can be computed with one back-propagation cycle so calculation is efficient [LDS90]. 
MacKay, however, found the method inaccurate for his purposes [Mac91] and instead 
dropped the first summation from Equation (2.4) and all but the first term from Equa- 
tion (2.3) leaving a calculation that requires only the first derivatives, 


8 2 E 

dw mt idw n j 


E 

l } k ^output-nodes 


d 2 E dui duk 
duidu k dw mt i dw n j 


He reports this is inaccurate when approximating the determinant of the matrix of second 
derivatives, or looking at individual second derivatives, but seems fine when approximating 
the trace of the matrix. This approximation, ignoring the second derivatives, corresponds 
to the Levenberg-Marquardt approximation in non-linear least squares [PFTV88, p523]. 
Clearly, better approximations involving a few more terms are available. 

A second form of approximation exists if the network cost function being minimized 
corresponds to the negative logarithm of the likelihood of the training sample, as is often the 
case when using mean-square error or cross-entropy cost functions [BW91a, BW87, EJM90]. 
Suppose the likelihood of the training sample of size D is given as a product over patterns 
(xj, Vd) in the sample 

p(y\x,w) = JJ l(y d | x d , w) , 

d=i,...,v 

where /(y,* | x,*, tu) is the likelihood of the d-th pattern and is a function of the network output 
for input x, and the “correct” output y,-. Then maximum likelihood training (maximum a 
posterior training is similar) corresponds to minimizing the energy function with component 
for each pattern 

E d = -log l(y d | xa, w) . 

For instance, when using the mean-square error cost function on a single output, we are 
implicitly using the Gaussian likelihood 

l(yd\x d ,w) = ^exp (~^a(w -<*)*) ’ 


where <r is the standard deviation and Od is the network output for input x<*. 

When network weights are near a local maxima of the likelihood the second derivatives 
can be approximated using the following formula: 


E 


&Ej 

9Wm t i dw m ,j 


d 2 - log p(y | x,w) 
dw m ,idw m ,i 


E 

sl,..., 

E 

-1 

■*f M i 

E 




d 2 - log f(y d | id, w) 
dw mii dw m ,j 

dlogljM | x d ,w) dlogl(y d | x^w) 
dw m>i dw m j 

dEj dEj 

dw m<i dw m<j 


(2.9) 


Notice this only requires calculation of the first derivatives of the energy. The approximation 
step is justified because at a local maxima, on the assumption that the weights w are 
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approximately correct (which they may well not be) and the sample size D is large [Ame85, 
p.14], the twosides are approximately equal. This is the approximation used in the “scoring” 
method of maximum likelihood training [Ame85]. This approximation is best used during 
search when fast estimates of second derivates are required. The approximation would be 
misleading when after good approximations for error bars because it assumes that the thing 
we are attempting to evaluate is approximately true (that the weights are correct). 

A third more exact approximation of second derivatives can be done by numerical dif- 
ferentiation of the first derivatives, which in turn can be calculated using standard back- 
propagation. This corresponds to 

d 2 E(w) ^ 1 /dE(w + Aw mt i) dE(w)\ 

dw mi idw n j ~ V dw n j dw n j ) 

If there are |iy| different weights in the network, then this requires |tz>| + 1 back-propagation 
cycles to compute the necessary first derivatives (compared to 2 K back/for ward- propagation 
cycles for the exact calculation, where K is the number of nodes). This method has been 
used in the case where there is a small number of weights by MacKay [Mac91]. 

A more efficient approach is to use numerical differentiation to calculate second deriva- 
tives of the activations, , in 2T+1 back-propagation cycles, and to calculate additional 

first derivatives as required in approximately ^ back-propagation cycles, and then use 
Equations (2.3) and (2.7). A similar method is suggested in [BL88]. This approximation 
is therefore of the same computational order as the exact calculations described previously, 
but requires little additional algorithm overhead other than the back-propagation algorithm. 
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